In this section we would like to investigate more about the regional characteristics of the case.
We will first use data visualization to get an overview.
complaint %>%
filter(level == "FELONY") %>%
drop_na(borough) %>%
group_by(year, borough) %>%
dplyr::summarize(n_obs = n()) %>%
ggplot(aes(x = reorder(borough, -n_obs), y = n_obs, fill = reorder(borough, -n_obs))) +
geom_bar(stat = 'identity') +
labs(
title = "Frequency of Felonies by Borough (2016-2022)",
x = "Borough",
y = "Frequency"
) +
theme(legend.position = "none")

According to the plot, Brooklyn had the most felonies, followed by Manhattan, and Bronx has about the same felonies as Queens. Staten Island has the least felonies.
complaint %>%
filter(level == "FELONY") %>%
drop_na(borough) %>%
group_by(year, borough) %>%
dplyr::summarize(n_obs = n()) %>%
dplyr::summarize(borough, percentage = n_obs / sum(n_obs)) %>%
ggplot(aes(x = year, y = percentage, fill = borough)) +
geom_bar(stat = 'identity') +
labs(
x = "Year",
y = "Proportion",
title = "Proportions of Felonies by Borough and Year",
fill = "Borough"
)

The proportion of felonies does not appear to have changed significantly over the years.
We use statistical tests to find if there is a difference in monthly crime records from different regions.
anova_table =
complaint %>%
filter(borough %in% c("BRONX", "BROOKLYN", "MANHATTAN", "QUEENS", "STATEN ISLAND")) %>%
mutate(monthly = str_c(as.character(year), as.character(month))) %>%
group_by(borough,monthly) %>%
dplyr::summarise(
n_obs = n(),
n_felony = sum(level == "FELONY"),
felony_rate = n_felony / n_obs)
anova_table %>%
group_by(borough) %>%
dplyr::summarise(
monthly_cases = mean(n_obs)
)
## # A tibble: 5 × 2
## borough monthly_cases
## <fct> <dbl>
## 1 BRONX 8316.
## 2 BROOKLYN 11060.
## 3 MANHATTAN 9423.
## 4 QUEENS 7843.
## 5 STATEN ISLAND 1656.
res_monthly_case = aov(n_obs ~ factor(borough), data = anova_table)
summary(res_monthly_case)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(borough) 4 4.146e+09 1.036e+09 1345 <2e-16 ***
## Residuals 400 3.083e+08 7.708e+05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pairwise.t.test(anova_table$n_obs, anova_table$borough, p.adj = 'bonferroni')
##
## Pairwise comparisons using t tests with pooled SD
##
## data: anova_table$n_obs and anova_table$borough
##
## BRONX BROOKLYN MANHATTAN QUEENS
## BROOKLYN < 2e-16 - - -
## MANHATTAN 1.2e-13 < 2e-16 - -
## QUEENS 0.0067 < 2e-16 < 2e-16 -
## STATEN ISLAND < 2e-16 < 2e-16 < 2e-16 < 2e-16
##
## P value adjustment method: bonferroni
The ANOVA test shows that the five boroughs do not have equal monthly cases, and the pairwise t test further reveals each borough differs from the others on monthly cases. Based on monthly cases, from highest to lowest, Brooklyn, Manhattan, Bronx, Queens, Staten Island.
anova_table %>%
group_by(borough) %>%
dplyr::summarise(
mean_felony_rate = mean(felony_rate)
)
## # A tibble: 5 × 2
## borough mean_felony_rate
## <fct> <dbl>
## 1 BRONX 0.296
## 2 BROOKLYN 0.330
## 3 MANHATTAN 0.319
## 4 QUEENS 0.324
## 5 STATEN ISLAND 0.247
res_felony_rate = aov(felony_rate ~ factor(borough), data = anova_table)
summary(res_felony_rate)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(borough) 4 0.3743 0.09358 260.4 <2e-16 ***
## Residuals 400 0.1438 0.00036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pairwise.t.test(anova_table$felony_rate, anova_table$borough, p.adj = 'bonferroni')
##
## Pairwise comparisons using t tests with pooled SD
##
## data: anova_table$felony_rate and anova_table$borough
##
## BRONX BROOKLYN MANHATTAN QUEENS
## BROOKLYN < 2e-16 - - -
## MANHATTAN 3.6e-12 0.0017 - -
## QUEENS < 2e-16 0.3439 0.9615 -
## STATEN ISLAND < 2e-16 < 2e-16 < 2e-16 < 2e-16
##
## P value adjustment method: bonferroni
The ANOVA test shows that the five boroughs do not have equal felony rate, and the pairwise t test further reveals that Bronx and Staten Island are different from others; Manhattan differs from Brooklyn, but does not differ from Queens; while Brooklyn does not differ from Queens either. Based on felony rate, from highest to lowest, Brooklyn, Queens, Manhattan, Bronx, Staten Island.
The test above shows that high monthly cases do not necessarily mean high felony rate. While Brooklyn both has the highest monthly cases and the highest felony rate, Staten Island both has the lowest monthly cases and the lowest felony rate.
NYPD_plot =
complaint %>%
mutate(
text_label = str_c("Borough: ", borough, "\nPrecinct: ", precinct, "\nLevel: ", level, "\nOffense: ", offense)) %>%
filter(year == 2022) %>%
plot_ly(
lat = ~latitude,
lon = ~longitude,
type = "scattermapbox",
mode = "markers",
alpha = 0.2,
color = ~ borough,
text = ~text_label) %>%
layout(
mapbox = list(
style = "carto-positron",
zoom = 9,
center = list(lon = -73.9, lat = 40.7)
)
)
NYPD_plot